Explore and Summarize Data - Prosper Loan

### by Rami Salman

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

I will ask my own questions about this data set. There are MANY variables in this data set and I will explore between 10- 15 variables in the this analysis.

1 - data cleaning:

1.1 - drop the keys and ids columns because no need of them in the analysis

process , this keys are: ListingKey, ListingNumber,LoanKey,MemberKey .

1.2 - now i will search about n/a valus:

##                 ListingCreationDate                         CreditGrade 
##                                   0                                   0 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                                   0                                  25 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                               29084                               29084 
##                     EstimatedReturn             ProsperRating..numeric. 
##                               29084                               29084 
##               ProsperRating..Alpha.                        ProsperScore 
##                                   0                               29084 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                   0 
##                          Occupation                    EmploymentStatus 
##                                   0                                   0 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                7625                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                                   0 
##                    DateCreditPulled               CreditScoreRangeLower 
##                                   0                                 591 
##               CreditScoreRangeUpper             FirstRecordedCreditLine 
##                                 591                                   0 
##                  CurrentCreditLines                     OpenCreditLines 
##                                7604                                7604 
##          TotalCreditLinespast7years               OpenRevolvingAccounts 
##                                 697                                   0 
##         OpenRevolvingMonthlyPayment                InquiriesLast6Months 
##                                   0                                 697 
##                      TotalInquiries                CurrentDelinquencies 
##                                1159                                 697 
##                    AmountDelinquent             DelinquenciesLast7Years 
##                                7622                                 990 
##            PublicRecordsLast10Years           PublicRecordsLast12Months 
##                                 697                                7604 
##              RevolvingCreditBalance                 BankcardUtilization 
##                                7604                                7604 
##             AvailableBankcardCredit                         TotalTrades 
##                                7544                                7544 
##  TradesNeverDelinquent..percentage.             TradesOpenedLast6Months 
##                                7544                                7544 
##                   DebtToIncomeRatio                         IncomeRange 
##                                8554                                   0 
##                    IncomeVerifiable                 StatedMonthlyIncome 
##                                   0                                   0 
##                   TotalProsperLoans          TotalProsperPaymentsBilled 
##                               91852                               91852 
##               OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate 
##                               91852                               91852 
##     ProsperPaymentsOneMonthPlusLate            ProsperPrincipalBorrowed 
##                               91852                               91852 
##         ProsperPrincipalOutstanding         ScorexChangeAtTimeOfListing 
##                               91852                               95009 
##           LoanCurrentDaysDelinquent       LoanFirstDefaultedCycleNumber 
##                                   0                               96985 
##          LoanMonthsSinceOrigination                          LoanNumber 
##                                   0                                   0 
##                  LoanOriginalAmount                 LoanOriginationDate 
##                                   0                                   0 
##              LoanOriginationQuarter                  MonthlyLoanPayment 
##                                   0                                   0 
##                 LP_CustomerPayments        LP_CustomerPrincipalPayments 
##                                   0                                   0 
##                  LP_InterestandFees                      LP_ServiceFees 
##                                   0                                   0 
##                   LP_CollectionFees               LP_GrossPrincipalLoss 
##                                   0                                   0 
##                 LP_NetPrincipalLoss     LP_NonPrincipalRecoverypayments 
##                                   0                                   0 
##                       PercentFunded                     Recommendations 
##                                   0                                   0 
##          InvestmentFromFriendsCount         InvestmentFromFriendsAmount 
##                                   0                                   0 
##                           Investors 
##                                   0

as a result of the previous step , their is a lot of columns with very large number of n/a (more than 2000 instances! ) , i will drop theis columns. theis columns includs: EstimatedEffectiveYield, EstimatedLoss , EstimatedReturn , ProsperRating..Alpha. , CurrentCreditLines , OpenCreditLines , AmountDelinquent, PublicRecordsLast12Months , RevolvingCreditBalance, BankcardUtilization , AvailableBankcardCredit , TotalTrades , TradesNeverDelinquent..percentage., TradesOpenedLast6Months ,DebtToIncomeRatio, TotalProsperLoans, TotalProsperPaymentsBilled , OnTimeProsperPayments , ProsperPaymentsLessThanOneMonthLate , ProsperPaymentsOneMonthPlusLate , ProsperPrincipalBorrowed , ProsperPrincipalOutstanding , ScorexChangeAtTimeOfListing,OpenCreditLines, CurrentCreditLines, CreditScoreRangeUpper,EstimatedReturn,EstimatedLoss, LoanFirstDefaultedCycleNumber.

##                 ListingCreationDate                         CreditGrade 
##                                   0                                   0 
##                                Term                          LoanStatus 
##                                   0                                   0 
##                          ClosedDate                         BorrowerAPR 
##                                   0                                  25 
##                        BorrowerRate                         LenderYield 
##                                   0                                   0 
##             EstimatedEffectiveYield                       EstimatedLoss 
##                               29084                               29084 
##                     EstimatedReturn             ProsperRating..numeric. 
##                               29084                               29084 
##               ProsperRating..Alpha.                        ProsperScore 
##                                   0                               29084 
##           ListingCategory..numeric.                       BorrowerState 
##                                   0                                   0 
##                          Occupation                    EmploymentStatus 
##                                   0                                   0 
##            EmploymentStatusDuration                 IsBorrowerHomeowner 
##                                7625                                   0 
##                    CurrentlyInGroup                            GroupKey 
##                                   0                                   0 
##                    DateCreditPulled               CreditScoreRangeLower 
##                                   0                                 591 
##               CreditScoreRangeUpper             FirstRecordedCreditLine 
##                                 591                                   0 
##                  CurrentCreditLines                     OpenCreditLines 
##                                7604                                7604 
##          TotalCreditLinespast7years               OpenRevolvingAccounts 
##                                 697                                   0 
##         OpenRevolvingMonthlyPayment                InquiriesLast6Months 
##                                   0                                 697 
##                      TotalInquiries                CurrentDelinquencies 
##                                1159                                 697 
##                    AmountDelinquent             DelinquenciesLast7Years 
##                                7622                                 990 
##            PublicRecordsLast10Years           PublicRecordsLast12Months 
##                                 697                                7604 
##              RevolvingCreditBalance                 BankcardUtilization 
##                                7604                                7604 
##             AvailableBankcardCredit                         TotalTrades 
##                                7544                                7544 
##  TradesNeverDelinquent..percentage.             TradesOpenedLast6Months 
##                                7544                                7544 
##                   DebtToIncomeRatio                         IncomeRange 
##                                8554                                   0 
##                    IncomeVerifiable                 StatedMonthlyIncome 
##                                   0                                   0 
##                   TotalProsperLoans          TotalProsperPaymentsBilled 
##                               91852                               91852 
##               OnTimeProsperPayments ProsperPaymentsLessThanOneMonthLate 
##                               91852                               91852 
##     ProsperPaymentsOneMonthPlusLate            ProsperPrincipalBorrowed 
##                               91852                               91852 
##         ProsperPrincipalOutstanding         ScorexChangeAtTimeOfListing 
##                               91852                               95009 
##           LoanCurrentDaysDelinquent       LoanFirstDefaultedCycleNumber 
##                                   0                               96985 
##          LoanMonthsSinceOrigination                          LoanNumber 
##                                   0                                   0 
##                  LoanOriginalAmount                 LoanOriginationDate 
##                                   0                                   0 
##              LoanOriginationQuarter                  MonthlyLoanPayment 
##                                   0                                   0 
##                 LP_CustomerPayments        LP_CustomerPrincipalPayments 
##                                   0                                   0 
##                  LP_InterestandFees                      LP_ServiceFees 
##                                   0                                   0 
##                   LP_CollectionFees               LP_GrossPrincipalLoss 
##                                   0                                   0 
##                 LP_NetPrincipalLoss     LP_NonPrincipalRecoverypayments 
##                                   0                                   0 
##                       PercentFunded                     Recommendations 
##                                   0                                   0 
##          InvestmentFromFriendsCount         InvestmentFromFriendsAmount 
##                                   0                                   0 
##                           Investors 
##                                   0

in the previous step i droped the columns with high number of n/a , so it remains some cells with n/a’s now i will drop the n/a rows and save them in new record called new_df

after the previous cleaning steps, our new data frame contains 50 feature and 8453 observation we will work on them. The new data set contains data from 2009.

2- Univariate Analysis

2.1- What is the structure of your dataset?

## 'data.frame':    84834 obs. of  52 variables:
##  $ ListingCreationDate            : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 111894 64760 85967 100310 72556 74019 97834 97834 54939 100485 ...
##  $ CreditGrade                    : Factor w/ 9 levels "","A","AA","B",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Term                           : int  36 36 36 60 36 36 36 36 60 36 ...
##  $ LoanStatus                     : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 4 4 4 4 4 4 4 4 8 ...
##  $ ClosedDate                     : Factor w/ 2803 levels "","2005-11-25 00:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BorrowerAPR                    : num  0.12 0.125 0.246 0.154 0.31 ...
##  $ BorrowerRate                   : num  0.092 0.0974 0.2085 0.1314 0.2712 ...
##  $ LenderYield                    : num  0.082 0.0874 0.1985 0.1214 0.2612 ...
##  $ ProsperRating..numeric.        : int  6 6 3 5 2 4 7 7 4 5 ...
##  $ ProsperScore                   : num  7 9 4 10 2 4 9 11 7 4 ...
##  $ ListingCategory..numeric.      : int  2 16 2 1 1 2 7 7 1 1 ...
##  $ BorrowerState                  : Factor w/ 52 levels "","AK","AL","AR",..: 7 12 25 34 18 6 16 16 22 3 ...
##  $ Occupation                     : Factor w/ 68 levels "","Accountant/CPA",..: 43 52 21 43 50 29 24 24 22 50 ...
##  $ EmploymentStatus               : Factor w/ 9 levels "","Employed",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EmploymentStatusDuration       : int  44 113 44 82 172 103 269 269 300 1 ...
##  $ IsBorrowerHomeowner            : Factor w/ 2 levels "False","True": 1 2 2 2 1 1 2 2 1 1 ...
##  $ CurrentlyInGroup               : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
##  $ GroupKey                       : Factor w/ 707 levels "","00343376901312423168731",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ DateCreditPulled               : Factor w/ 112992 levels "2005-11-09 00:30:04.487000000",..: 111883 64724 85857 100382 72500 73937 97888 97888 53800 100573 ...
##  $ CreditScoreRangeLower          : int  680 800 680 740 680 700 820 820 640 680 ...
##  $ FirstRecordedCreditLine        : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 6617 2247 9498 497 8265 7685 5543 5543 4395 6853 ...
##  $ TotalCreditLinespast7years     : int  29 29 49 49 20 10 32 32 56 29 ...
##  $ OpenRevolvingAccounts          : int  13 7 6 13 6 5 12 12 4 8 ...
##  $ OpenRevolvingMonthlyPayment    : num  389 115 220 1410 214 101 219 219 25 290 ...
##  $ InquiriesLast6Months           : int  3 0 1 0 0 3 1 1 1 1 ...
##  $ TotalInquiries                 : num  5 1 9 2 0 16 6 6 2 4 ...
##  $ CurrentDelinquencies           : int  0 4 0 0 0 0 0 0 1 0 ...
##  $ DelinquenciesLast7Years        : int  0 14 0 0 0 0 0 0 28 0 ...
##  $ PublicRecordsLast10Years       : int  1 0 0 0 0 1 0 0 1 0 ...
##  $ IncomeRange                    : Factor w/ 8 levels "$0","$1-24,999",..: 5 4 3 3 4 4 4 4 6 4 ...
##  $ IncomeVerifiable               : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ StatedMonthlyIncome            : num  6125 2875 9583 8333 2083 ...
##  $ LoanCurrentDaysDelinquent      : int  0 0 0 0 0 0 0 0 0 3 ...
##  $ LoanMonthsSinceOrigination     : int  0 16 6 3 11 10 3 3 22 2 ...
##  $ LoanNumber                     : int  134815 77296 102670 123257 88353 90051 121268 121268 65946 125045 ...
##  $ LoanOriginalAmount             : int  10000 10000 15000 15000 3000 10000 10000 10000 13500 4000 ...
##  $ LoanOriginationDate            : Factor w/ 1873 levels "2005-11-15 00:00:00",..: 1866 1535 1757 1821 1649 1666 1813 1813 1419 1829 ...
##  $ LoanOriginationQuarter         : Factor w/ 33 levels "Q1 2006","Q1 2007",..: 8 32 24 33 16 16 33 33 15 8 ...
##  $ MonthlyLoanPayment             : num  319 321 564 342 123 ...
##  $ LP_CustomerPayments            : num  0 5143 2820 679 1227 ...
##  $ LP_CustomerPrincipalPayments   : num  0 4091 1563 352 604 ...
##  $ LP_InterestandFees             : num  0 1052 1257 327 622 ...
##  $ LP_ServiceFees                 : num  0 -108 -60.3 -25.3 -22.9 ...
##  $ LP_CollectionFees              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_GrossPrincipalLoss          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NetPrincipalLoss            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ LP_NonPrincipalRecoverypayments: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PercentFunded                  : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Recommendations                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsCount     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ InvestmentFromFriendsAmount    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Investors                      : int  1 158 20 1 1 1 1 1 19 1 ...

2.2- What is/are the main feature(s) of interest in your dataset?

  • ListingCreationDate: The date the listing was created.

  • Term: The length of the loan expressed in months.

  • LoanStatus : The current status of the loan: Cancelled, Chargedoff,Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.

  • BorrowerRate: The Borrower’s interest rate for this loan.

  • ProsperRating (numeric): The Prosper Rating assigned at the time the listing was created: 0 - N/A, 1 HR, 2 - E, 3 - D, 4 - C, 5 - B, 6 - A, 7 - AA. Applicable for loans originated after July 2009.

  • ListingCategory : The category of the listing that the borrower selected when posting their listing: 0 Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans .

  • EmploymentStatus: The employment status of the borrower at the time they posted the listing.

  • EmploymentStatusDuration: The length in months of the employment status at the time the listing was created.

  • CurrentlyInGroup: Specifies whether or not the Borrower was in a group at the time the listing was created.

  • LoanOriginalAmount: The origination amount of the loan.

  • LoanOriginationQuarter: The quarter in which the loan was originated.

  • prosperScore: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.

  • IncomeRange : The income range of the borrower at the time the listing was created.

2.3 proposed questions about the selected features:

  1. for the LoanStatus feature , what is the count of each LoanStatus? and distribution of them?
  2. what is the distribution of prosperScore ?
  3. what is the Distribution of BorrowerRate ? 4 what is the ProsperRating distribution?
  4. distribution of loans with the days:
  5. what is the distribution of loans with the months ?
  6. what is the Distribution of ListingCategory?
  7. what is the Distribution of EmploymentStatus?
  8. what is the Distribution EmploymentStatusDuration ?
  9. what is the distribution of IsBorrowerHomeowner?
  10. what is the distribution of CurrentlyInGroup ?
  11. Distribution of LoanOriginalAmount ?
  12. what is the Distribution of LoanOriginationQuarter?
  13. what is the distribution of terms?
  14. what is the distibution of income range ?
  15. What’s the distribution of BorrowerState?

2.4.1- for the LoanStatus feature , what is the count of each LoanStatus?

and distribution of them?

## # A tibble: 11 x 2
## # Groups:   LoanStatus [11]
##    LoanStatus                 n
##    <fct>                  <int>
##  1 Chargedoff              5334
##  2 Completed              19657
##  3 Current                56566
##  4 Defaulted               1005
##  5 FinalPaymentInProgress   205
##  6 Past Due (>120 days)      16
##  7 Past Due (1-15 days)     806
##  8 Past Due (16-30 days)    265
##  9 Past Due (31-60 days)    363
## 10 Past Due (61-90 days)    313
## 11 Past Due (91-120 days)   304

The loan status is most important feautre that is an indicator to the loan sucsess . As a result of status group visualization , the loan status with the highest count is the current status followed by completed status. I notice that the categories are more detailed , so I’m going to group the similar groups togother as the following: Defaulted: Chargedoff, Defaulted,Cancelled , Current : Current , FinalPaymentInProgress, completed for the completed status, and the other values in the Past Due group.

the new groups plotted in the next cell:

After new groups created , it still that current status is the highest group , then completed status then the defualted . In the bivariate exploration step we will plot the groups of completed and defulted with the other features .

2.4.2 - what is the distribution of prosperScore ?

  • The highest prosperScores are between 4 and 8 .

2.4.3- what is the Distribution of BorrowerRate ?

  • BorrowerRate is about to normal distribution with some exeptions.

2.4.4- what is the ProsperRating distribution?

The highest category prosperRating is for C and the lowest one is for AA , note that no N/A category because I’ve already dropped them in the data cleaning step.

Now we will find the distribution of loans among the history (forwarding of days).

2.4.5 - distribution of loans with the days:

## Warning in strptime(xx, ff, tz = "GMT"): unable to identify current timezone 'W':
## please set environment variable 'TZ'

  • we see that the number of loans increases by forwarding the days , that means improvement of company/bank work.

2.4.6 - what is the distribution of loans with the months ?

  • we notice that the number of loans changes among different months. In general some of the summer months (July , Aughest , Septemper) and Janury have number of loans less than the others. On the other hand , spring months (March,April,May) and some of winter months (November and December ) have number of loans higher than the others.

2.4.7- what is the Distribution of ListingCategory?

## [1] '0.7.7'

  • It’s not surprising that the highest category count of loans is debt consolidation ,it seems that it’s count more than the summation count of all other categories!this is because this category indivsuals need money permenantly

2.4.8- what is the Distribution of EmploymentStatus?

*Employed ones have the highest count , this is because the employed have more stability than the others and have more dealing with the banks.

2.4.9- what is the Distribution EmploymentStatusDuration ?

  • note : EmploymentStatusDuration is stored in months , so i divided it by 12 to get the number of years instead and saved it in new column named EmploymentStatusDurationYears.

  • see the following summary of EmploymentStatusDuration(years)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.500   6.167   8.588  12.333  62.917
  • as a result of the previous plots and summary , EmploymentStatusDuration distribution is skewed right! and quarter of employees have duration of 2.5 years and half of them less than 6.167 years duration. That means younger employees(with small employment status duration) have loans more than the others

2.4.10- what is the distribution of IsBorrowerHomeowner?

  • indivisuals with borrower home owner have loans a little bit higher than the others.

2.4.11- what is the distribution of CurrentlyInGroup ?

  • indivisual who not currently in group have loans extremly higher than the others.

2.4.12- Distribution of LoanOriginalAmount ?

- It seems abviously that loan amounts are integers (they have peaks) like , 4000 , 10,000, 15,000 , 20,000 and other integers , and the loans of 4000 have the highest count.

2.4.13- what is the Distribution of LoanOriginationQuarter?

  • This result confirms the result of question 2.5 , which concludes that by forwarding days/years number of loan increases that is a good indicator of improving the bank work!

2.4.14 - what is the distribution of terms?

  • It is abvious that terms(length of loan expressed in months) distributed in three terms : 10,35,60 , and most of terms are with about 35 months.

2.4.15 - what is the distibution of income range ?

  • For the employees , the highest count of loans is for the 50,000 - 74,999 and 25,000 - 49,999 income ranges and the not employed have small count .

2.4.16- What’s the distribution of BorrowerState?

new_df %>% 
    mutate(BorrowerState=BorrowerState  %>%  fct_infreq() )%>%
    ggplot(aes(x=BorrowerState)) + geom_bar()

The state with the highest number of loans is CA then NY,TX,FL . On the other hand, the states with the lowest number of loans are SD , VT , AK and WY.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think year of the loan , IncomeVerifiable ,so i will investegate them in bivariate section .

Did you create any new variables from existing variables in the dataset?

yes ,I created many new variables as the following: 1. In question 2.1 , I grouped the loanStatus in more generic group , I named it by LoanStatusGroup .

  1. I extracted the month from ListingCreatenDate and ploted the distribution of them in 2.5 question.

  2. In question 2.7 , I created a new column named LisitingCategory that mapping the LisitingCategory..numeric. to it’s name , like (0 - Not Available, 1 -
    Debt Consolidation, 2. Home Improvement, …etc).

  3. In question 2.9 , I created a new column named EmploymentStatusDurationYears by dividing EmploymentStatusDuration (months) by 12 .

  4. In Question 2.14 , I created a new column named ProsperRating that mapping the ProsperRating..numeric. to it’s name , like (0 - N/A, 1 - HR, 2 - E, … etc).

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The employedStatusDuration plot is skewed right (not normally distributed) and ProsperScore is near to normal distribution but with no peak! Yes I Performed some operations to tidy data , before I started the visualization and exploration of data , I dropped unnecissary columns and the columns with high number of NA’s, then I droped the remaining NA rows . Then when I started the exploration step , I created many new columns of the existing columns , I explained this step in the previous question.

3 Bivariate Plots Section

In this part , I will explore the correlation coefficient between some numeric variables and visualize the strong relationships(high correlation). For the categorical variables , I will split the categories and explore them. and For the date , i will explore different variables among date.

3.1 - find the correlations:

3.1.1- EmploymentStatusDurationYears and LoanOriginalAmount :

## 
##  Pearson's product-moment correlation
## 
## data:  EmploymentStatusDurationYears and LoanOriginalAmount
## t = 22.851, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07152509 0.08490118
## sample estimates:
##        cor 
## 0.07821665
  • The correlation between EmploymentStatusDuration and LoanOriginalAmount is weak.

3.1.2 - correlation between ProsperRating..numeric. and BorrowerRate .

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and BorrowerRate
## t = -917.27, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9537178 -0.9524851
## sample estimates:
##        cor 
## -0.9531054

Strong negative relation !

3.1.3 - ProsperRating..numeric. and LoanOriginalAmount

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperRating..numeric. and LoanOriginalAmount
## t = 138.17, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4231065 0.4340925
## sample estimates:
##       cor 
## 0.4286153

correlation between ProsperRating..numeric. and LoanOriginalAmount is medium , see the following visualization

3.1.4- EmploymentStatusDurationYears and LoanOriginalAmount

## 
##  Pearson's product-moment correlation
## 
## data:  EmploymentStatusDurationYears and LoanOriginalAmount
## t = 22.851, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07152509 0.08490118
## sample estimates:
##        cor 
## 0.07821665
  • correlation between EmploymentStatusDurationYears and LoanOriginalAmount is too weak.

3.1.5- BorrowerRate and LoanOriginalAmount

## 
##  Pearson's product-moment correlation
## 
## data:  BorrowerRate and LoanOriginalAmount
## t = -132.27, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4190646 -0.4079072
## sample estimates:
##        cor 
## -0.4135014

correlation between BorrowerRate is LoanOriginalAmount medium negative , see the following visualization :

3.1.6- ProsperScore and ProsperRating..numeric.

## 
##  Pearson's product-moment correlation
## 
## data:  ProsperScore and ProsperRating..numeric.
## t = 289.71, df = 84832, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7018241 0.7085893
## sample estimates:
##       cor 
## 0.7052228

as you see, strong correlation between ProsperScore and ProsperRating, see the following visualization.

3.2- now for the categorical features:

3.2.1- prosper rating and loan status:

as a result , HR,E,D ratings have higher prpability of defaulting , in the other hand AA,A,B ratings , have lower proabability of defaulting.

3.2.2- loan status and term(year)

  • loans of 1 year term have higher probability of completing.

3.2.3- ListingCategory and LoanStatusGroup:

  • Green loans , Cosmatics Procedure , Student Use , Business Categories have higher probability of defaulting .

3.2.4- EmploymentStatus and LoanStatusGroup

  • not employeed ,retired empolyment status’s have a higher probabaility of defaulting ,on the other hand employed status has low probability of defaulting.

3.2.5- employed status and rating :

  • also not employeed status has higher number of HR rating (worst rating).

3.2.6- income range and loan status group

## # A tibble: 7 x 2
## # Groups:   IncomeRange [7]
##   IncomeRange        n
##   <fct>          <int>
## 1 $0                45
## 2 $1-24,999       4652
## 3 $100,000+      15202
## 4 $25,000-49,999 24167
## 5 $50,000-74,999 25623
## 6 $75,000-99,999 14496
## 7 Not employed     649

The number of borrowers with 0- salary or not employed is too small , although of that , it’s not surprisng that theis two categories have higher probability of defaulting . In general , lower salary income leads to higher probability of defaulting . Also we can observe that higher salaries have high percentage of current , that means employees with higher salary are increasing in the bank , to ensure that I will explore income range among dates later in this analyisis.

3.2.7- Term and income range

## Warning: position_stack requires non-overlapping x intervals

  • their is no big difference between distributions of income ranges among terms.

3.2.8 - EmploymentStatusDurationYears and LoanStatusGroup

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  • loans have a peak at low emplyment duration of years(at about 2- 3 years). And the distribution of loan status groups among the employment status duration mostly do the same Behavior.

now i will extract the year from listingCreationDate to use it in next visualizations.

3.2.9- extract year from listing creatio date

## Warning: Expected 2 pieces. Additional pieces discarded in 84834 rows [1,
## 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

  • It is abviously that number of loans increases each year , you see that year 2014 is lower than 2013 , this is because the recorded data of 2014 only for the first quarter of the year(first three months only ). This is a goog indicator of improving the bank work.

3.2.10 - loan status group and years

The number of defaulted loans decreases by new years and the current groups increases because they are currently created.

3.2.11 - Term and year

  • In the first years it’s abviously that terms of 36 months are dominent, and then the number of loans of 60 month start increasing . On the other hand, loans of 12 month have a very small percentage.

3.2.12- employment status and years

  • Percent of employed status increases instead of full- time status. The difference between full- time and employed is

3.2.13- borrower rate and years

  • the borrower rate in the first years is spare , but in 2013 the rates concentrate about 0.2.To ensure the distribution , I will visualize distribution of borrower rate in each year as the following:

* As I said in the previous plot , the number of loans in early years is small and increasing toward the years . It is abvious that upon 2012 borrower rate is not distributed normally , and In 2013 , 2014 ( first 3 months only ) the distribution is skewed right.

3.2.14- currently in group and years

  • Number of borrowers are currently in group decreases by the years.

3.2.15 - prosper score with loan status group

3.2.16 - BorrowerRate and LoanStatusGroup

  • the borrower rate distribution of defaulted status group is skewed left , that means : higher borrower rate lead to higher probability of defaulting.

3.2.17 - prosper rating with years

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Computation failed in `stat_smooth()`:
## x has insufficient unique values to support 10 knots: reduce k.

  • prosper rating increases by years.

3.2.18 - default percent among employment status

  • You can see that higher income range have higher default percent!

3.2.19 - ProsperRating among income range .

  • again , the lower income ranges have the worst ratings !

3.2.20- IncomeRange vs BorrowerAPR

  • the higher income range have lower borrower APR , On the other hand lower income range have higher borrower APR.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I investigated the relationship between loanStatusGroup (after grouping them in 4 general groups, completed, defaulted, past due, current ) and 8 other features , as the following : 5 numeric features ( year, EmploymentStatusDurationYears, Term, ProsperScore, BorrowerRate) and 4 categorical features (IncomeRange, EmploymentStatus, ListingCategory,ProsperRating). Also I explored the changing of theis features among the years. As a result , the following features have a relationship with the loan status and can be used as a good indicator of defaulting : prosper score , employed status , income range, prosper rating and borrower rate.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

As in 3.1.2 their is a strong negative relationship between prosperRating and borrowerRate with correlation coefitiont of about - 0.95! And as in 3.1.6 , their is a strong positive relationship between ProsperScore and ProsperRating with correlation coefficient of about 0.705 .

What was the strongest relationship you found?

for the main features of interest , the strongest relationship is between loan status group and prosper score followed by the income range . and for the other features , the strongest one is between prosperRating and borrowerRate with correlation coefficient of about - 0.95 .

4- Multivariate Plots Section

4.1 - prosper rating among years with different loan status groups.

## Warning: Ignoring unknown parameters: binwidth, bins, pad

in the previous plot , I ploted prosper rating in combinations of loan status among years , you can see that for most of defaulted plots are skewed right (negative ratings more) and the completed plots skewed left (positive ratings more).

4.2- ProsperRating vs ProsperScore behaviour among different years and loan

status groups.

We got the same result of 3.1.6 that concludes that prosper rating and prosper score have strong positive relationship.

4.3- prosper rating vs default percent among years.

  • in all years except 2013 and 2014 prosper rating and default percent have strong nagative ralationship , in 2013 and 2014 the relationship is weaker because number of defaulted loans is lower.

4.4- prosper score among years with default percent.

  • the result is similar to previous plot , but here default percent decreases with slope slightly higher than in the prosper rating .

4.5- heatmap of ProsperRating in the distribution of prosperScore among

years.

  • this plot ensure that prosper ratings have a strong relation with prosper score , and each of them increases by years. and that a good indicator of improving the bank work.

4.6- heatmap of BorrowerRate in the distribution of prosperScore vs

ProsperRating .

  • borrowerRate affect prosperRating more than ProsperScore , notice that at high prosperRating(>4) the region is yellow that means lower borrower rate .

4.7 - heatmap of dafault percent in the distribution of prosperScore vs

prosper rating.

  • see that for high values of prosper rating and prosper score (especially for prosper rating) the area color is yellow that means low default percent , In the other hand low values of prosper rating and prosper score (especially for prosper rating) the area color is red that means high default percent.

4.8- default percent heatmap among years.

4.9 - compare heatmaps of default percent in prosperScore among years and

prosperRating among years.

## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine

You can see that default percent of 2013 and 2014 is too small. And we got the same result as we observed in the previous plots for each of prosper score and prosper rating that low prosper scores or low prosper rating means high default percent and high values of them lead to low percent of default percent.

4.10 - heatmap of default ratio among income range vs income verifiable.

  • It’s abvious that the combinantion of IncomeVerifiable with IncomeRange has a high effect at the default percent! In addition to what said previously that low income ranges have high percent of defaulting , the false status of IncomeVerifiable also increase the default percent.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

As a result of part 4.6 , I found that loans of high borrower rate (0.3) have a low prosper rating and as in 4.7 the low prosper rate has a higher default percent.Also I found that the the number of defaulted loans decreases by years , see 3.2.10 and 4.8 , and if we connect this observation with the borrower rate , we can see that also by years the borrower rates are decreases as in 3.2.13 and 3.2.16 .

Were there any interesting or surprising interactions between features?

the combinantion of IncomeVerifiable with IncomeRange has a high effect at the default percent! In addition to that low income ranges have high percent of defaulting , the false status of IncomeVerifiable also increase the default percent. - - - - - -

5- Final Plots and Summary

5.1- Plot One: prosper rating and loan status:

Description One:

HR,E,D ratings (worst ratings) have higher prpability of defaulting , in the other hand AA,A,B ratings (best ratings) have lower proabability of defaulting. So we can conclude that lower prosper ratings means higher probability of defaulting.

5.2- Plot Two : heatmap of BorrowerRate in the distribution of prosperScore

vs ProsperRating .

Description Two:

borrowerRate affect prosperRating more than ProsperScore , notice that at high prosperRating(>4) the region is yellow that means lower borrower rate. And each of borrower rate and Prosper rating have a high correlation with the default ratio , so we can say that higher borrower rate means lower prosper rating then highr default probability.

5.3- Plot Three : heatmap of default ratio among income range vs income

verifiable.

Description Three

It’s abvious that the combinantion of IncomeVerifiable with IncomeRange has a high effect at the default percent! In addition to that income ranges have high percent of defaulting , the false status of IncomeVerifiable also increase the default percent.


6- Reflection

  1. IncomeRange: It has a high influence on default loan status , we can say that lower income ranges or not employed status have a higher probability of defaulting. On the other hand , higher income ranges have lower probability of defaulting.

  2. ProsperRating: It has a high influence on default loan status , we can say that lower ProsperRating(HR,E,D) have a higher probability of defaulting. On the other hand , higher ProsperRatings(A,AA,B) have lower probability of defaulting.

  3. BorrowerRate: It has a high influence on default loan status , we can say that higher BorrowerRates leads to lower prosper rating and a higher probability of defaulting. On the other hand , lower borrowerRates have ahigher prosperRating and lower probability of defaulting.

  4. IncomeVerifiable: It has a high influence on default loan status , we can say that loans in the True IncomeVerifiable group have a lower probability of defaulting. On the other hand , loans in the False IncomeVerifiable group have a higher probability of defaulting.

7- References:

  1. udacity videos , lectures and other resources.
  2. R documentation.
  3. https://www.quora.com/How-do-I-extract-month-and-year-from-date-in-R
  4. https://dplyr.tidyverse.org/reference/recode.html